## [1] 4898 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality
## 6 :2198
## 5 :1457
## 7 : 880
## 8 : 175
## 4 : 163
## 3 : 20
## (Other): 5
The dataset contains 11 chemical component observations of 4898 white wines.Acidity,sugar,alcohol,etc.The quality score of the wine is shown in 0-10.
The quality score of the wine is close to the normal distribution, the average wine is in the majority (the score is 5-7), there are also a few poor quality wines, the quality is extremely rare.So the quality of the wine is related to what factors, the most important characteristics of the usual drinking, alcohol concentration, taste (sweet and sour).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Adjust group width and X axis, most white wine's alcohol concentration is 9.5~11.The median is 10.40 and the mean is 10.51,they're very close.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The content of non-volatile acid -tartaric acid is approximately normal distribution, with an average of 6.8g/dm^3, and the median is 6.855g/dm^3.In the grading of wine quality, for the x axis, fixed the acidity for Y axis box figure, found the vast majority of non-volatile acid content in 6 g/dm^3 between 3 ~ 9 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The histogram of volatile acidity is left skewed , transform the data using a log transform.
## [1] 7 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Most wine have citric acidity betweed 0.2-0.4g/dm^3 ,19 wine citric acidity is 0,and 7 over 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
Add an total_acidity(fixed acidity+volatile acidity+citric acid) colume to dataset,most white wine total acidity value between 7-8g/dm^3,create the quality facet_grid,There is no indication that the wine of that quality has a lower or higher acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## [1] 18 13
Most residual.sugar value less than 20g/dm^3,transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the residual.sugar peaking around 2 or so and again at 10 or so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most chlorides value betweed 0.035-0.05g/dm^3,Some of them are pretty big,the max value is 0.346g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The histogram of free.sulfur.dioxide is a normal distribution.Median is 34.00 and mean is 35.31,and have maximum 289.00. Most free sulfur dioxide between 25-50mg/dm^3,A few data is larger than 100mg/dm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
It is a normal distribution,media is 134.0,mean is 138.4.It also have some extreme value,min is 9.0,max is 440.Most total.sulfur.dioxide value between 105mg/dm^3-170mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.I also want to know the influence of bound sulfur dioxide.Create a column named 'bound.sulfur.dioxide'(total.sulfur.dioxide-free.sulfur.dioxide).Build probability density curve,it seens high quality wine distribution skew to left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
It's like a normal distribution.Most density value between 0.991g/cm^3-0.996g/cm^3,Max value is 1.039g/cm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The histagram of pH is normal distribution.The median is 3.18,mean 3.188.Max value 3.820
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## [1] 4898 14
Most white wine have a sulphates value between 0.4g/dm^3-0.55g/dm^3.
Number of Instances:white wine - 4898.
Number of Attributes: 11 + output attribute(quality)
variables:
fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm3)
alcohol (% by volume)
Output variable (based on sensory data):
quality (score between 0 and 10)
white_wine data stucture Contains numerous variables (11), the chemical components of white wine, such as acidity, sugar, chlorides,dioxide,alcohol accuracy(% by volume),PH value, sulphates, and quality(score between 0 and 10).
I interest in how the pH value,residual sugar and alcohol concentration of white wine influence the wine quality
As we know,the acid/chlorides content of the wine,will influence the pH value.
I crate two new variables from existing variables,total.acidity and bound.sulfur.dioxide.
The min volatile.acidity value is 0.08,and the max value is 1.1 ,distribution skew to left,I use log(x),transform distribution ,it is easy to observe.I do the same thing to residual.sugar variables,transformed the long tail data to better understand the distribution of residual sugar.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality total_acidity bound.sulfur.dioxide
## 6 :2198 Min. : 4.130 Min. : 4.0
## 5 :1457 1st Qu.: 6.890 1st Qu.: 78.0
## 7 : 880 Median : 7.405 Median :100.0
## 8 : 175 Mean : 7.467 Mean :103.1
## 4 : 163 3rd Qu.: 7.960 3rd Qu.:125.0
## 3 : 20 Max. :14.960 Max. :331.0
## (Other): 5
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "total_acidity" "bound.sulfur.dioxide"
Check the correlation coefficients between variables with ggcorr function.
The alcohol degree was negatively correlated with density and the correlation coefficient was 0.78.And residual sugar was positively correlated with density and the correlation coefficient was 0.84.total.sulfur.dioxide was positively correlated with density and the correlation coefficient was 0.53.
The fixed.acidity was negatively correlated with pH and the correlation codfficient was 0.43.
Next, explore the relationship between the two variables.Like:acidity and pH,alcohol,residual.sugar,total.sulfur.dioxide ,quality and density...
Acid is negatively correlated with pH,as fixed acidity increases,the pH value decrease.The relationship between fixed acidity appears to be linear.Next,check the relationship between other acids and pH.
##
## Pearson's product-moment correlation
##
## data: whites_wine$volatile.acidity and whites_wine$pH
## t = -2.2343, df = 4896, p-value = 0.02551
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.059868312 -0.003912409
## sample estimates:
## cor
## -0.03191537
As volatile acidity increase,the pH value not change muach.
##
## Pearson's product-moment correlation
##
## data: whites_wine$citric.acid and whites_wine$pH
## t = -11.614, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1908793 -0.1363671
## sample estimates:
## cor
## -0.1637482
It still smoothes trend,between citric acidity and pH.But there is a similar trend between citric acid, non-volatile acid and pH. I'm going to explore the relationship between all the acids and the ph.
##
## Pearson's product-moment correlation
##
## data: whites_wine$total_acidity and whites_wine$pH
## t = -33.388, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4531918 -0.4075605
## sample estimates:
## cor
## -0.4306513
Total acidity is more continuous,as total acidity increases,the pH value decrease.The relationship between fixed acidity appears to be linear.
##
## Pearson's product-moment correlation
##
## data: whites_wine$sulphates and whites_wine$pH
## t = 11.047, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1285063 0.1831580
## sample estimates:
## cor
## 0.1559515
Most sulphates value between 0.4-0.5 ,as sulphates increase ,pH not change much.
##
## Pearson's product-moment correlation
##
## data: whites_wine$alcohol and whites_wine$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09374446 0.14893205
## sample estimates:
## cor
## 0.1214321
As alcohol increase ,pH not change much.
##
## Pearson's product-moment correlation
##
## data: whites_wine$residual.sugar and whites_wine$pH
## t = -13.847, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2209387 -0.1670352
## sample estimates:
## cor
## -0.1941335
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
When the residual sugar less than 5 g/dm^3,pH value is more uncertainty.As residual sugar over about 5g/dm^3,pH value decrease slowly.
The pH of white wine is low on both sides.It seems that when the pH value higher,it's quality of white wine depends on othe chemical component.Need to explore the chemicals that make pH higher.
White wines with high quality scores also have high median sulfate.
With the increase of residual sugar content, white wine density obviously increased.
As alcohol increase,the value of density decrease.
The residual sugar value is high in the middle.As residual sugar decrease ,it seems that the quality of white wine depends on othe chemical component. Need to explore the chemicals that make sugar lower.
White wine with a high quality score has a higher median alcohol content.
White wine with a high quality score is relatively low in density.
High quality wines have a higher pH median than normal wines.Wines with high pH values also contain poor quality.
High quality white wines have lower median sugar levels than regular wines, but they also contain poor quality wines.It may be that the sugar is too high and too sweet, but too low can cause other problems.
High quality white wines are significantly more alcoholic than regular wines.
I also found that white wine pH associated with acid and sulfate, and was mainly affected by not volatile acid, not the more the content of volatile acid, pH value is smaller, the more the content of sulphate, high pH change accordingly.
The density of white wine decreases with the increase of alcohol content.
The strongest relationship is as alcohol increase ,white wine density decrease,and quality of wine get more score.
It can be seen that the quality score of white wine is more distributed in areas with high alcohol degree and low density.
Wines with high quality scores are more distributed in low-sugar, low-density areas.
It can be seen that the quality of the wine is higher in the region with low fixed acid content and higher pH value.
It's the same distribution as the non-volatile acid.
The quality score of white wine is higher in the upper left corner.
Fewer acids increase the pH, and quality scores tend to increase.
The density has a lot to do with the quality of white wine.
White wine quality score has a lot to do with the size of its density, density decreases, and the score increased.Residual sugar and wine alcohol degree and is the main factor affecting the density, thus affecting the quality of the wine
I did not crate models with my dataset.
It can be seen that most the quality of the wine is higher in the region with low acidity content and higher pH value.May be acidity affects the value of the pH, which in turn affects the quality of the wine.
High - quality wines have higher levels of alcohol than ordinary wines but poor quality wines have higher concentrations of alcohol than normal wines,, may be because of other chemical components affect its quality.
It can be seen that the quality score of white wine is more distributed in areas with high alcohol degree and low density.And there is a dominant relationship between alcohol degree and density. As the degree of alcohol increases, the density decreases.
The relationship between the pH value of white wine, the alcohol degree and the residual sugar was just beginning to be explored.Because these are the characteristics that we pay attention to when we taste wine.Only the histogram, box graph and scatter diagram are used to explore, and some representational relationships are found roughly.The quality score is set as an ordered factor.The factors that affect the pH value are also explored, and it is found that acidic substances are the main causes of pH, but other relevant factors have not been found.The exploration of alcohol and residual sugar found that they were related to the density of fish wine, so I went to explore the relationship between density and quality.The quality of the wine is related to the pH and density, but I can't find the balance between the pH and the density to make the wine better.As for this data exploration, I think it is necessary to have a clear idea, and according to the actual situation, such as the characteristics of wine in reality, we should not blindly explore it.But at the same time, you can't explore it with a conclusion.
Through this data exploration, the future analysis work should have a clear understanding of the data and optimize the analytical thinking;Learn to process data, include data groups, and remodel to better discover data patterns.